Search CORE

49 research outputs found

Algorithm-Directed Crash Consistence in Non-Volatile Memory for HPC

Author: Li Dong
Qiao Yifan
Wu Kai
Yang Shuo
Zhai Jidong
Publication venue
Publication date: 16/05/2017
Field of study

Fault tolerance is one of the major design goals for HPC. The emergence of non-volatile memories (NVM) provides a solution to build fault tolerant HPC. Data in NVM-based main memory are not lost when the system crashes because of the non-volatility nature of NVM. However, because of volatile caches, data must be logged and explicitly flushed from caches into NVM to ensure consistence and correctness before crashes, which can cause large runtime overhead. In this paper, we introduce an algorithm-based method to establish crash consistence in NVM for HPC applications. We slightly extend application data structures or sparsely flush cache blocks, which introduce ignorable runtime overhead. Such extension or cache flushing allows us to use algorithm knowledge to \textit{reason} data consistence or correct inconsistent data when the application crashes. We demonstrate the effectiveness of our method for three algorithms, including an iterative solver, dense matrix multiplication, and Monte-Carlo simulation. Based on comprehensive performance evaluation on a variety of test environments, we demonstrate that our approach has very small runtime overhead (at most 8.2\% and less than 3\% in most cases), much smaller than that of traditional checkpoint, while having the same or less recomputation cost.Comment: 12 page

arXiv.org e-Print Archive

Crossref

ScalAna: Automating Scaling Loss Detection with Graph Analysis

Author: Hoefler Torsten
Jin Yuyang
Liu Xu
Tang Xiongchao
Wang Haojie
Yu Teng
Zhai Jidong
Publication venue
Publication date: 01/01/2020
Field of study

Scaling a parallel program to modern supercomputers is challenging due to inter-process communication, Amdahl's law, and resource contention. Performance analysis tools for finding such scaling bottlenecks either base on profiling or tracing. Profiling incurs low overheads but does not capture detailed dependencies needed for root-cause analysis. Tracing collects all information at prohibitive overheads. In this work, we design ScalAna that uses static analysis techniques to achieve the best of both worlds - it enables the analyzability of traces at a cost similar to profiling. ScalAna first leverages static compiler techniques to build a Program Structure Graph, which records the main computation and communication patterns as well as the program's control structures. At runtime, we adopt lightweight techniques to collect performance data according to the graph structure and generate a Program Performance Graph. With this graph, we propose a novel approach, called backtracking root cause detection, which can automatically and efficiently detect the root cause of scaling loss. We evaluate ScalAna with real applications. Results show that our approach can effectively locate the root cause of scaling loss for real applications and incurs 1.73% overhead on average for up to 2,048 processes. We achieve up to 11.11% performance improvement by fixing the root causes detected by ScalAna on 2,048 processes.Comment: conferenc

arXiv.org e-Print Archive

Repository for Publications and Research Data

Collaborative Heterogeneity-Aware OS Scheduler for Asymmetric Multicore Processors

Author: Janjic Vladimir
Leather Hugh
Petoumenos Pavlos
Thomson John
Yu Teng
Zhai Jidong
Zhong Runxin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 23/12/2020
Field of study

Funding: This work is supported in part by the China Postdoctoral Science Foundation (Grant No. 2020TQ0169), the ShuiMu Tsinghua Scholar fellowship (2019SM131), National Key R&D Program of China (2020AAA0105200), National Natural Science Foundation of China (U20A20226), Beijing Natural Science Foundation (4202031), Beijing Academy of Artificial Intelligence BAAI), the UK EPSRC grants Discovery: Pattern Discovery and Program Shaping for Manycore Systems (EP/P020631/1). This work is also supported by the Royal Academy of Engineering under the Research Fellowship scheme.Asymmetric multicore processors (AMP) offer multiple types of cores under the same programming interface. Extracting the full potential of AMPs requires intelligent scheduling decisions, matching each thread with the right kind of core, the core that will maximize performance or minimize wasted energy for this thread. Existing OS schedulers are not up to this task. While they may handle certain aspects of asymmetry in the system, none can handle all runtime factors affecting AMPs for the general case of multi-threaded multi-programmed workloads. We address this problem by introducing COLAB, a general purpose asymmetry-aware scheduler targeting multi-threaded multi-programmed workloads. It estimates the performance and power of each thread on each type of core and identifies communication patterns and bottleneck threads. With this information, the scheduler makes coordinated core assignment and thread selection decisions that still provide each application its fair share of the processor’s time. We evaluate our approach using both the GEM5 simulator on four distinct big.LITTLE configurations and a development board with ARM Cortex-A73/A53 processors and mixed workloads composed of PARSEC and SPLASH2 benchmarks. Compared to the state-of-the art Linux CFS and AMP-aware schedulers, we demonstrate performance gains of up to 25% and 5% to 15% on average,together with an average 5% energy saving depending on the hardware setup.PostprintPeer reviewe

Crossref

The University of Manchester - Institutional Repository

University of Dundee Online Publications

University of St. Andrews - Pure

St Andrews Research Repository

Recommended from our members

Protective effect of human serum amyloid P on CCl4-induced acute liver injury in mice.

Author: Bao Xiaoli
Brenner David A
Cong Min
Fan Xu
Jia Jidong
Kisseleva Tatiana
Liu Tianhui
Wang Ping
You Hong
Zhai Qingling
Zhang Dong
Zhao Weihua
Zhuang Hui
Publication venue: eScholarship, University of California
Publication date: 01/01/2017
Field of study

Human serum amyloid P (hSAP), a member of the pentraxin family, inhibits the activation of fibrocytes in culture and inhibits experimental renal, lung, skin and cardiac fibrosis. As hepatic inflammation is one of the causes of liver fibrosis, in the present study, we investigated the hepatoprotective effects of hSAP against carbon tetrachloride (CCl4)-induced liver injury. Our data indicated that hSAP attenuated hepatic histopathological abnormalities and significantly decreased inflammatory cell infiltration and pro-inflammatory factor expression. Moreover, CCl4-induced apoptosis in the mouse liver was inhibited by hSAP, as measured by terminal-deoxynucleotidyl transferase mediated nick-end labeling (TUNEL) assay and cleaved caspase-3 expression. hSAP significantly restored the expression of B cell lymphoma/leukemia (Bcl)-2 and suppressed the expression of Bcl-2-associated X protein (Bax) in vivo. The number of hepatocytes in early apoptosis stained with Annexin V was significantly reduced by 28-30% in the hSAP treatment group compared with the CCl4 group, and the expression of Bcl-2 was increased, whereas the expression of Bax and cleaved caspase-3 were significantly inhibited in the hSAP pre-treatment group compared with the CCl4 group. hSAP administration also inhibited the migration and activation of hepatic stellate cells (HSCs) in CCl4-injured liver and suppressed the activation of isolated primary HSCs induced by transforming growth factor (TGF)-β1 in vitro. Collectively, these findings suggest that hSAP exerts a protective effect againts CCl4-induced hepatic injury by suppressing the inflammatory response and hepatocyte apoptosis, potentially by inhibiting HSC activation

eScholarship - University of California

Guiding the PLMs with Semantic Anchors as Intermediate Supervision: Towards Interpretable Semantic Parsing

Author: Du Lun
Han Shi
Hou Lei
Li Juanzi
Nie Lunyiu
Sun Jiuding
Wang Yanlin
Zhai Jidong
Zhang Dongmei
Publication venue
Publication date: 04/10/2022
Field of study

The recent prevalence of pretrained language models (PLMs) has dramatically shifted the paradigm of semantic parsing, where the mapping from natural language utterances to structured logical forms is now formulated as a Seq2Seq task. Despite the promising performance, previous PLM-based approaches often suffer from hallucination problems due to their negligence of the structural information contained in the sentence, which essentially constitutes the key semantics of the logical forms. Furthermore, most works treat PLM as a black box in which the generation process of the target logical form is hidden beneath the decoder modules, which greatly hinders the model's intrinsic interpretability. To address these two issues, we propose to incorporate the current PLMs with a hierarchical decoder network. By taking the first-principle structures as the semantic anchors, we propose two novel intermediate supervision tasks, namely Semantic Anchor Extraction and Semantic Anchor Alignment, for training the hierarchical decoders and probing the model intermediate representations in a self-adaptive manner alongside the fine-tuning process. We conduct intensive experiments on several semantic parsing benchmarks and demonstrate that our approach can consistently outperform the baselines. More importantly, by analyzing the intermediate representations of the hierarchical decoders, our approach also makes a huge step toward the intrinsic interpretability of PLMs in the domain of semantic parsing

arXiv.org e-Print Archive

PowerFusion: A Tensor Compiler with Explicit Data Movement Description and Instruction-level Graph IR

Author: Cao Huanqi
Huang Kezhao
Ma Zixuan
Tang Shizhi
Wang Haojie
Wang Penghan
Xing Jingze
Zhai Jidong
Zhang Chen
Zheng Liyan
Publication venue
Publication date: 10/07/2023
Field of study

Deep neural networks (DNNs) are of critical use in different domains. To accelerate DNN computation, tensor compilers are proposed to generate efficient code on different domain-specific accelerators. Existing tensor compilers mainly focus on optimizing computation efficiency. However, memory access is becoming a key performance bottleneck because the computational performance of accelerators is increasing much faster than memory performance. The lack of direct description of memory access and data dependence in current tensor compilers' intermediate representation (IR) brings significant challenges to generate memory-efficient code. In this paper, we propose IntelliGen, a tensor compiler that can generate high-performance code for memory-intensive operators by considering both computation and data movement optimizations. IntelliGen represent a DNN program using GIR, which includes primitives indicating its computation, data movement, and parallel strategies. This information will be further composed as an instruction-level dataflow graph to perform holistic optimizations by searching different memory access patterns and computation operations, and generating memory-efficient code on different hardware. We evaluate IntelliGen on NVIDIA GPU, AMD GPU, and Cambricon MLU, showing speedup up to 1.97x, 2.93x, and 16.91x(1.28x, 1.23x, and 2.31x on average), respectively, compared to current most performant frameworks.Comment: 12 pages, 14 figure

arXiv.org e-Print Archive